Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.
Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician
ASCII subject name and recording number:
Vocal Fundamental Frequency:
Measures of variation in fundamental frequency:
Measures of variation in amplitude:
Two measures of ratio of noise to tonal components in the voice:
Two nonlinear dynamical complexity measures:
Signal fractal scaling exponent: DFA
Three nonlinear measures of fundamental frequency variation:
status - Health status of the subject (one) - Parkinson's, (zero) - healthy
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import pearsonr
from sklearn import metrics
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc, classification_report, recall_score, precision_score, f1_score
from sklearn.model_selection import cross_val_score
import pandas_profiling
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")
#reading the data
df = pd.read_csv("Data - Parkinsons")
df.head()
df.shape
df.dtypes
df.astype({'status': 'category'}).dtypes
df.isnull().sum()
df.isna().sum()
df.describe().T
The mean and median of most of the attributes seem to differ a lot by margin. This will determine the skewness of each column. We will take a look at each of them below.
All the values in spread1 are negative. We will scale them later for better model building.
plt.figure(figsize=(5,5))
plt.style.use("ggplot")
plt.title("Output Class (status) Distribution")
plt.xticks([0,1])
sns.countplot(df["status"]);
plt.show();
n_true = len(df.loc[df["status"] == True])
n_false = len(df.loc[df["status"] == False])
print("Number of positive cases: {0} ({1:2.2f}%)".format(n_true, (n_true / (n_true + n_false)) * 100 ))
print("Number of negative cases: {0} ({1:2.2f}%)".format(n_false, (n_false / (n_true + n_false)) * 100))
for col in df.columns:
print(col)
df = df.drop("name", axis=1)
df.head()
sns.pairplot(df)
#loop to plot all numerical attributes
for i, col in enumerate(df.drop("status",axis=1).columns):
plt.style.use('seaborn-pastel')
# create a subplot with 2 windows: one boxplot, one histogram
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=(12,6))
sns.boxplot(df[col], ax=ax_box, color="indianred");
sns.distplot(df[col], ax=ax_hist, color="indianred");
#vertical lines for median and mode
plt.axvline(np.median(df[col]),color='darkblue', linestyle='-.',label="Median", lw=3)
plt.axvline(np.mean(df[col]),color='green', linestyle=':',label="Mean", lw=3)
ax_box.set(xlabel='')
ax_hist.set(xlabel=col)
plt.legend(loc="lower right")
x=df[col]
mu = x.mean()
median = np.median(x)
sigma = x.std()
textstr = '\n'.join((r'$\mu=%.5f$' % (mu, ),
r'$\mathrm{median}=%.5f$' % (median, ),
r'$\sigma=%.5f$' % (sigma, )))
# place a text box in upper left in axes coords
props = dict(boxstyle='round', facecolor='wheat', alpha=1)
ax_hist.text(0.70, 0.95, textstr, transform=ax_hist.transAxes, fontsize=14,
verticalalignment='top', bbox=props)
plt.show();
#skewness
skewness = stats.skew(df[col]);
if(abs(skewness)<0.5):
print(f"Skewness of {col} is: {round(skewness,3)}, hence the distribution is fairly normal." )
elif(skewness>0.5):
print(f"Skewness of {col} is: {round(skewness,3)}, hence it is right skewed" )
else:
print(f"Skewness of {col} is: {round(skewness,3)}, hence it is left skewed" )
print("------------------------------------------------------------------------")
plt.figure(figsize=(30,20))
sns.heatmap(df.corr(), annot=True, linewidths=.1, fmt= '.1f', center = 1 ) # heatmap
plt.show()
# High correlation visualization
# Absolute values only.
cutoff = 0.9 # only datapoints with correlations higher than or equal to cutoff to be plotted.
hi_corr_df = df.drop('status', axis=1).corr()[abs(df.corr())>=cutoff].round(2)
plt.figure(figsize=(30,20))
sns.heatmap(hi_corr_df.corr(), annot=True, linewidths=.1, fmt= '.1f', center = 1 ) # heatmap
plt.show()
The above highlighted variables have really high correlation (>0.9)
for i, col in enumerate(df.drop(labels=['status'],axis=1).columns):
plt.figure(i)
plt.xlabel('status');
plt.ylabel(col);
sns.swarmplot(y=col, data=df, x="status")
corr, _ = pearsonr(df['status'], df[col])
plt.title(f"Status v/s {col}", fontsize=15)
plt.show();
print(f"Correlation between Status and {col} is {round(corr,2)}")
print("___________________________________________________")
X = df.drop(labels=['status'], axis=1) #dropping target variable
X.head(10)
y = df["status"] #target variable in y
pd.DataFrame(y).head()
#splitting in a 70:30 ratio
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#scaling the values
scaler = MinMaxScaler(feature_range=(0,1)) #scale of 0-1
##min-max scaling
x_train = scaler.fit_transform(x_train)
x_train = pd.DataFrame(x_train, columns=X.columns)
x_train.head() #train values after scaling
x_test = scaler.transform(x_test)
x_test = pd.DataFrame(x_test, columns=X.columns)
x_test.head() #test values after scaling
from sklearn.linear_model import LogisticRegression
# Fit the model on training set
logistic = LogisticRegression(solver="liblinear")
logistic.fit(x_train, y_train)
#predict on test
y_predict = logistic.predict(x_test)
y_predict ##our predicted values on test set
logistic_score_train = logistic.score(x_train, y_train)
print(f"Train accuracy for logistic regression is {logistic_score_train}")
logistic_score_test = logistic.score(x_test, y_test)
print(f"Test accuracy for logistic regression is {logistic_score_test}")
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
LR_precision = precision_score(y_test, y_predict)
print(f"Logistic Regression - Precision: {LR_precision}")
LR_recall = recall_score(y_test, y_predict)
print(f"Logistic Regression - Recall: {LR_recall}")
LR_f1 = f1_score(y_test, y_predict)
print(f"Logistic Regression - F1 Score: {LR_f1}")
print(metrics.classification_report(y_test, y_predict, labels=[1, 0]))
LRprob=logistic.predict_proba(x_test)
fpr1, tpr1, thresholds1 = roc_curve(y_test, LRprob[:, 1])
roc_aucLR = auc(fpr1, tpr1)
print("Area under the ROC curve : %f" % roc_aucLR)
#ROC Curve for logistic regression
plt.figure(figsize = (12,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr1,tpr1)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC for Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
LR = pd.DataFrame({'Accuracy': [logistic_score_test],
'Precision': [LR_precision],
'Recall': [LR_recall],
'F1 Score': [LR_f1],
'AUC':[roc_aucLR]},index=["Logistic Regression"])
LR
from sklearn.naive_bayes import GaussianNB # using Gaussian algorithm from Naive Bayes
from sklearn.naive_bayes import BernoulliNB # using Bernoulli algorithm from Naive Bayes
# create the model
gauss = GaussianNB()
bern = BernoulliNB()
gauss.fit(x_train, y_train)
print(f"Test Accuracy for GaussianNB: {gauss.score(x_train, y_train)}")
bern.fit(x_train, y_train)
print(f"Test Accuracy for BernoulliNB: {bern.score(x_train, y_train)}")
y_pred_gauss = gauss.predict(x_test)
y_pred_bern = bern.predict(x_test)
test_score_NB = accuracy_score(y_test, y_pred_bern)
print(f"Test Accuracy for GaussianNB: {gauss.score(x_test, y_test)}")
print(f"Test Accuracy for BernoulliNB: {test_score_NB}")
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_bern, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
NB_precision = precision_score(y_test, y_pred_bern)
print(f"Naive Bayes - Precision: {NB_precision}")
NB_recall = recall_score(y_test, y_pred_bern)
print(f"Naive Bayes - Recall: {NB_recall}")
NB_f1 = f1_score(y_test, y_pred_bern)
print(f"Naive Bayes - F1 Score: {NB_f1}")
print(metrics.classification_report(y_test, y_pred_bern, labels=[1, 0]))
NBprob=bern.predict_proba(x_test)
fpr2, tpr2, thresholds2 = roc_curve(y_test, NBprob[:, 1])
roc_aucNB = auc(fpr2, tpr2)
print("Area under the ROC curve : %f" % roc_aucNB)
#ROC Curve for NB
plt.figure(figsize = (12,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr2,tpr2)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC for Naive Bayes')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
NB = pd.DataFrame({'Accuracy': [test_score_NB],
'Precision': [NB_precision],
'Recall': [NB_recall],
'F1 Score': [NB_f1],
'AUC':[roc_aucNB]},index=["Naive Bayes"])
NB
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors = 3)
# fitting the model
knn.fit(x_train, y_train)
# predict the response
y_pred = knn.predict(x_test)
# evaluate accuracy
print(f"Score for k=3: {accuracy_score(y_test, y_pred)}")
# instantiate learning model (k = 5)
knn = KNeighborsClassifier(n_neighbors=5)
# fitting the model
knn.fit(x_train, y_train)
# predict the response
y_pred = knn.predict(x_test)
# evaluate accuracy
print(f"Score for k=5: {accuracy_score(y_test, y_pred)}")
# instantiate learning model (k = 9)
knn = KNeighborsClassifier(n_neighbors=9)
# fitting the model
knn.fit(x_train, y_train)
# predict the response
y_pred = knn.predict(x_test)
# evaluate accuracy
print(f"Score for k=9: {accuracy_score(y_test, y_pred)}")
knn = KNeighborsClassifier(n_neighbors = 11)
# fitting the model
knn.fit(x_train, y_train)
# predict the response
y_pred = knn.predict(x_test)
# evaluate accuracy
print(f"Score for k=11: {accuracy_score(y_test, y_pred)}")
knn = KNeighborsClassifier(n_neighbors = 9)
# fitting the model
knn.fit(x_train, y_train)
train_score_knn = knn.score(x_train, y_train)
print(f"Train Score for KNN: {train_score_knn}")
# predict the response
y_pred = knn.predict(x_test)
test_score_knn = knn.score(x_test, y_test)
# evaluate test accuracy
print(f"Test Score for KNN: {test_score_knn}")
Accuracy of 94.9% of test set
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
knn_precision = precision_score(y_test, y_pred)
print(f"kNN - Precision: {knn_precision}")
knn_recall = recall_score(y_test, y_pred)
print(f"kNN - Recall: {knn_recall}")
knn_f1 = f1_score(y_test, y_pred)
print(f"kNN - F1 Score: {knn_f1}")
print(metrics.classification_report(y_test, y_pred, labels=[1, 0]))
# ROC curve and area under the curve for kNN
knnprob=knn.predict_proba(x_test)
fpr3, tpr3, thresholds4 = roc_curve(y_test, knnprob[:, 1])
roc_aucknn = auc(fpr3, tpr3)
print("Area under the ROC curve : %f" % roc_aucknn)
#ROC Curve for kNN
plt.figure(figsize = (12,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr3,tpr3)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC for kNN')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
knn = pd.DataFrame({'Accuracy': [test_score_knn],
'Precision': [knn_precision],
'Recall': [knn_recall],
'F1 Score': [knn_f1],
'AUC':[roc_aucknn]},index=["kNN"])
knn
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50, random_state=1,max_features=12)
rfcl = rfcl.fit(x_train, y_train)
y_pred = rfcl.predict(x_test)
test_Score_RF = rfcl.score(x_test, y_test)
print(f"Random Forest Test score: {test_Score_RF}")
cm=confusion_matrix(y_test, y_pred,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
RF_precision = precision_score(y_test, y_pred)
print(f"Random Forest - Precision: {RF_precision}")
RF_recall = recall_score(y_test, y_pred)
print(f"Random Forest - Recall: {RF_recall}")
RF_f1 = f1_score(y_test, y_pred)
print(f"Random Forest - F1 Score: {RF_f1}")
print(metrics.classification_report(y_test, y_pred, labels=[1, 0]))
# ROC curve and area under the curve for RandomForest
RFprob=rfcl.predict_proba(x_test)
fpr4, tpr4, thresholds4 = roc_curve(y_test, RFprob[:, 1])
roc_aucRF = auc(fpr4, tpr4)
print("Area under the ROC curve : %f" % roc_aucRF)
#ROC Curve for RandomForest
plt.figure(figsize = (12,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr4,tpr4)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC for Random Forest')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
RF = pd.DataFrame({'Accuracy': [test_Score_RF],
'Precision': [RF_precision],
'Recall': [RF_recall],
'F1 Score': [RF_f1],
'AUC':[roc_aucRF]},index=["Random Forest"])
RF
from sklearn.ensemble import BaggingClassifier
bgcl = BaggingClassifier(n_estimators=50,random_state=1)
bgcl = bgcl.fit(x_train, y_train)
y_pred = bgcl.predict(x_test)
BC_test_score = bgcl.score(x_test , y_test)
print(f"The score on test data for Bagging is {BC_test_score}")
cm=confusion_matrix(y_test, y_pred,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
BC_precision = precision_score(y_test, y_pred)
print(f"Bagging Classifier - Precision: {BC_precision}")
BC_recall = recall_score(y_test, y_pred)
print(f"Bagging Classifier - Recall: {BC_recall}")
BC_f1 = f1_score(y_test, y_pred)
print(f"Bagging Classifier - F1 Score: {BC_f1}")
print(metrics.classification_report(y_test,y_pred,labels=[0,1]))
# ROC curve and area under the curve for BaggingClassifier
baggingcl=bgcl.predict_proba(x_test)
fpr5, tpr5, thresholds5 = roc_curve(y_test, baggingcl[:, 1])
roc_aucBC = auc(fpr5, tpr5)
print("Area under the ROC curve : %f" % roc_aucBC)
#ROC Curve for BaggingClassifier
plt.figure(figsize = (12,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr5,tpr5)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC for Bagging Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
BC = pd.DataFrame({'Accuracy': [BC_test_score],
'Precision': [BC_precision],
'Recall': [BC_recall],
'F1 Score': [BC_f1],
'AUC':[roc_aucBC]},index=["Bagging Classifier"])
BC
xgboost = XGBClassifier()
xgboost.fit(x_train, y_train)
y_pred = xgboost.predict(x_test)
y_pred
xgboost_score_test = xgboost.score(x_test, y_test)
print(f"Test score for xgboost is {xgboost_score_test}")
print(classification_report(y_test, y_pred))
cm=confusion_matrix(y_test, y_pred,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
xgb_precision = precision_score(y_test, y_pred)
print(f"XGBoost - Precision: {xgb_precision}")
xgb_recall = recall_score(y_test, y_pred)
print(f"XGBoost - Recall: {xgb_recall}")
xgb_f1 = f1_score(y_test, y_pred)
print(f"XGBoost - F1 Score: {xgb_f1}")
print(metrics.classification_report(y_test,y_pred,labels=[0,1]))
# ROC curve and area under the curve for XGBoost
xgboostProb=xgboost.predict_proba(x_test)
fpr6, tpr6, thresholds6 = roc_curve(y_test, xgboostProb[:, 1])
roc_aucxgb = auc(fpr6, tpr6)
print("Area under the ROC curve : %f" % roc_aucxgb)
#ROC Curve for XGBoost
plt.figure(figsize = (12,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr6,tpr6)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC for XGBoost')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
xgb = pd.DataFrame({'Accuracy': [xgboost_score_test],
'Precision': [xgb_precision],
'Recall': [xgb_recall],
'F1 Score': [xgb_f1],
'AUC':[roc_aucxgb]},index=["XGBoost"])
xgb
from mlxtend.classifier import StackingClassifier
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = LogisticRegression() # RandomForestClassifier(random_state=1)
clf3 = BernoulliNB()
clf4 = RandomForestClassifier(random_state=1)
rf = RandomForestClassifier(random_state=1)
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3, clf4], meta_classifier = rf)
for clf, label in zip([clf1, clf2, clf3, clf4, sclf], ['KNN', 'LR', 'Naive Bayes', 'RF',
'StackingClassifier']):
scores = cross_val_score(clf, x_train, y_train, cv=15, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
sclf.fit(x_train,y_train)
y_pred = sclf.predict(x_test)
y_pred
sclf_score_test = sclf.score(x_test,y_test)
print(f"Test score for stacking classifier is {sclf_score_test}")
print(classification_report(y_test, y_pred))
cm=confusion_matrix(y_test, y_pred,labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g');
sclf_precision = precision_score(y_test, y_pred)
print(f"Stacking Classifier - Precision: {sclf_precision}")
sclf_recall = recall_score(y_test, y_pred)
print(f"Stacking Classifier - Recall: {sclf_recall}")
sclf_f1 = f1_score(y_test, y_pred)
print(f"Stacking Classifier - F1 Score: {sclf_f1}")
print(metrics.classification_report(y_test,y_pred,labels=[0,1]))
# ROC curve and area under the curve for StackingClassifier
stackProb=sclf.predict_proba(x_test)
fpr7, tpr7, thresholds7 = roc_curve(y_test, stackProb[:, 1])
roc_aucsc = auc(fpr7, tpr7)
print("Area under the ROC curve : %f" % roc_aucsc)
#ROC Curve for StackingClassifier
plt.figure(figsize = (12,6))
plt.style.use('seaborn-darkgrid')
plt.plot(fpr7,tpr7)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.0])
plt.title('ROC for Stacking Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
sc = pd.DataFrame({'Accuracy': [sclf_score_test],
'Precision': [sclf_precision],
'Recall': [sclf_recall],
'F1 Score': [sclf_f1],
'AUC':[roc_aucsc]},index=["Stacking Classifier"])
sc
models = [LR, NB, knn, RF, BC, xgb, sc]
model = pd.concat(models)
model_names = ['Logistic', 'NaiveBayes', 'kNN', 'RandomForest' , 'BaggingClassifier', 'XGBoost', 'StackingClassifier']
model
#plot to view above metrics
plt.figure(figsize= (17,15))
plt.style.use("ggplot")
plt.subplot(3,2,1)
plt.barh(model_names,model['Accuracy'], edgecolor = 'black', alpha = 0.7)
plt.xlabel('Accuracy Comparision')
plt.subplot(3,2,2)
plt.barh(model_names,model['Precision'], edgecolor = 'black', alpha = 0.7)
plt.xlabel('Precision Comparision')
plt.subplot(3,2,3)
plt.barh(model_names,model['Recall'], edgecolor = 'black', alpha = 0.7)
plt.xlabel('Recall Comparision')
plt.subplot(3,2,4)
plt.barh(model_names,model['F1 Score'], edgecolor = 'black', alpha = 0.7)
plt.xlabel('F1 Score Comparision')
plt.subplot(3,2,5)
plt.barh(model_names,model['AUC'], edgecolor = 'black', alpha = 0.7)
plt.xlabel('ROC-AUC Comparision')
plt.show()